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Abstract 

Background: Bacillus anthracis, the causative agent of anthrax, is a globally distributed zoonotic pathogen that 
continues to be a veterinary and human health problem in Central Asia. We used a database of anthrax outbreak 
locations in Kazakhstan and a subset of genotyped isolates to model the geographic distribution and ecological 
associations of B. anthracis in Kazakhstan. The aims of the study were to test the influence of soil variables on a 
previous ecological niche based prediction of B. anthracis in Kazakhstan and to determine if a single sub-lineage of 
B. anthracis occupies a unique ecological niche. 

Results: The addition of soil variables to the previously developed ecological niche model did not appreciably alter 
the limits of the predicted geographic or ecological distribution of B. anthracis in Kazakhstan. The A1.a experiment 
predicted the sub-lineage to be present over a larger geographic area than did the outbreak based experiment 
containing multiple lineages. Within the geographic area predicted to be suitable for B. anthracis by all ten best 
subset models, the A1.a sub-lineage was associated with a wider range of ecological tolerances than the outbreak- 
soil experiment. Analysis of rule types showed that logit rules predominate in the outbreak-soil experiment and 
range rules in the A1.a sub-lineage experiment. Random sub-setting of locality points suggests that models of B. 
anthracis distribution may be sensitive to sample size. 

Conclusions: Our analysis supports careful consideration of the taxonomic resolution of data used to create 
ecological niche models. Further investigations into the environmental affinities of individual lineages and sub- 
lineages of B. anthracis will be useful in understanding the ecology of the disease at large and small scales. With 
model based predictions serving as approximations of disease risk, these efforts will improve the efficacy of public 
health interventions for anthrax prevention and control. 
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Background 

Anthrax is a disease of wildlife, livestock and humans 
that remains a public health problem throughout the 
world. Bacillus anthracis, the causative agent of anthrax, 
is a soil-borne, spore-forming bacterium which persists 
in soil for long periods of time under appropriate condi- 
tions [1]. Certain soil parameters, including pH, organic 
content and calcium, may be associated with spore sur- 
vival [1-5]. Anthrax outbreaks among livestock and 
wildlife result from exposure to these spores and are 
possibly influenced by climatic and physiological events 
[6,7]. In endemic areas, human cases of anthrax primar- 
ily result from contact with infected livestock during 
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slaughter or butchering [8,9] and control of livestock 
disease through vaccination and active surveillance of 
livestock and wildlife is essential for preventing human 
disease [10]. However, widespread active surveillance is 
costly and vaccination of every animal is not feasible. It 
is far more practical to focus these efforts on areas of 
high risk. To identify these it is necessary to improve 
our understanding of the ecology of B. anthracis 
through which animal infection occurs. 

The ecology of a pathogen such as B. anthracis can be 
explored using similar tools as those used for species 
distribution modelling and conservation planning. For 
example, ecological niche modelling (ENM) has been 
used to predict the potential ecological and geographic 
distribution of pathogens based on outbreak locations 
[10-14], presence of disease vectors [15-17] and disease 
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reservoirs [18]. The ecological niche of a pathogen, as 
for other types of species, is conceptualized as the N- 
dimensional hypervolume of ecological parameters 
within which the species can be maintained without 
immigration [19,20]. Various approaches to ENM iden- 
tify non-random associations between a species' locality 
data and environmental parameters. Ecological niche 
modelling experiments of B. anthracis are particularly 
useful considering the potential associations between 
spore survival and ecological conditions [1,5]. Results 
can be used as a proxy for disease risk and integrated 
into focused surveillance strategies for wildlife and live- 
stock in endemic areas and into vaccination strategies 
that target at risk herds before and during outbreak 
events [10]. 

Recently, studies of disease ecology have combined 
molecular genotyping techniques and ecological niche 
modelling to provide evidence that genetic lineages of a 
pathogen can have different environmental associations 
and potential geographic distributions [12,21]. In general 
B. anthracis has relatively limited global diversity. How- 
ever, multiple locus variable number tandem repeat ana- 
lysis (MLVA) systems for B. anthracis can differentiate 
strains into distinct lineages and sub-lineages [22-24]. 
Analyses of a global collection of B. anthracis isolates 
suggests that the A lineage is globally distributed, while 
other lineages (B and C) are geographically restricted. 
These findings may be explained by adaptive differences, 
some of which carry fitness costs that limit abundance 
and distribution of certain lineages or sub-lineages 
[3,24]. The ecological niche of B. anthracis has been 
modelled in the United States and Kazakhstan using 
locations of reported outbreaks [10,11,25,26]. A stated 
limitation of these experiments was that the outbreak 
data potentially included multiple strains of B. anthracis 
[10,11]. If lineages of B. anthracis do exhibit niche spe- 
cialization and unique geographic distributions, then it 
is plausible that current outbreak based ecological niche 
models are biased toward a dominant strain in a parti- 
cular landscape. It would then follow that single lineage 
models may better predict presence of the pathogen at 
local scales and increase the value of public health mea- 
sures [10]. 

Kazakhstan is situated in Central Asia, a region with 
some of the highest reported human anthrax incidence 
and mortality rates in the world [27,28]. The majority of 
human anthrax cases in Kazakhstan are related to expo- 
sure to infected livestock or handling of products 
derived from infected livestock [9]. In rural areas of 
Kazakhstan, veterinary care and surveillance programs 
are limited by the country's large land mass and widely 
distributed rural populations. Vaccination of livestock 
occurs mainly in response to detected outbreaks. In 
countries such as Kazakhstan, prioritizing areas for 



vaccination and surveillance are necessary for disease 
control. Our group recently created a multi-variate eco- 
logical niche model to characterize the broad environ- 
mental conditions that support B. anthracis across 
Kazakhstan [11,26]. In a parallel effort, Aikimbayev et al. 
used an eight marker MLVA typing system (MLVA-8) 
to describe the diversity of B. anthracis within Kazakh- 
stan from 88 archival strains [22,29]. 

In this study, we first expanded on the previously pub- 
lished outbreak based modelling experiment by adding 
four soil variables (pH, calcium levels, organic content 
and baseline water saturation) to the original set of 
environmental variables. Despite literature suggesting a 
strong relationship between soil characteristics such as 
high calcium levels and alkaline pH and spore persis- 
tence [1,4,5], the influence of available soil variables on 
B. anthracis ENM predictions has not been compara- 
tively examined [10]. We next used these twelve envir- 
onmental variables and the collection of MLVA-8 
genotyped samples to create an Al.a sub-lineage specific 
ecological niche model for Kazakhstan. 

Results 

Accuracy Metrics 

Ecological niche modelling was performed using the 
Genetic Algorithm for Rule-Set Prediction (GARP). Four 
experiments were run (outbreak-soil, Al.a sub-lineage, 
small southern outbreak and large southern outbreak) 
and are summarized in Table 1. All modelling processes 
reached convergence of accuracy (0.01) prior to reaching 
the maximum iteration setting (= 1,000). The outbreak- 
soil model had an Area Under the Curve (AUC) of 
0.7188 and was significantly different from a random 
model. Total omission of the outbreak-soil model was 
2.6% and average omission was 9.9%, indicating that 
97.4% of the testing points were predicted by at least 
one best subset model and 89.1% were predicted by all 
models. The AUC of the Al.a sub-lineage model was 

0. 6964 and was also significantly different from a ran- 
dom model and had a total and average omission of 0 
and 13.1, respectively. Both the large and small outbreak 
models had AUCs significantly different than random. 
Accuracy metrics for all models are shown in Table 2. 

Predicted Distributions of B. anthracis 

Locations used for input into GARP are shown in Figure 

1. Based on areas of agreement of a minimum of six of 
the best subset models, the outbreak-soil experiment 
predicted B. anthracis across much of northern Kazakh- 
stan and in a narrow band of the southeast. The interior 
of the country, which is primarily arid, was not pre- 
dicted to be suitable for the pathogen. The results are 
similar to those of the experiment without the soil vari- 
ables with respect to the geographic extent of areas of 
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Table 1 Summary of experiments 


Experiment 


External Data Split (%Training/% 


Area used for Model 


Locality Data 




Testing) 


Building 




Outbreak-Soil 


85/15 


All Kazakhstan 


All spatially unique livestock outbreaks 


A1.a Sub-lineage 


80/20 


Southern Polygon 


Spatially unique A1.a isolates in southern polygon 


Small Southern 


85/20 


Southern Polygon 


Random sub-set of spatially unique livestock outbreaks in 


Outbreak 






southern polygon 


Large Southern 


80/15 


Southern Polygon 


All spatially unique livestock outbreaks in southern polygon 


Outbreak 









six or more best subset model agreement (Figure 2). 
The outbreak-soil experiment expanded two areas in the 
north which had lower model agreement. The A La sub- 
lineage experiment predicts a more extensive geographic 
distribution than that of the outbreak experiment, 
including areas in the northern interior and western 
portions of the country (Figure 3). The northern pockets 
of less suitable geographic areas seen in the outbreak- 
soil experiment were predicted to be unsuitable based 
on agreement of six or more best subset models. The 
overall extents of the geographic predictions of the two 
experiments were more similar in the south than in the 
north. The large and small southern outbreak experi- 
ments both predicted similar geographic extents as the 
outbreak-soil experiment (Figure 4). All three projected 
experiments (Al.a sub-lineage, large southern outbreak 
and small southern outbreak) were run ten additional 
times using random external data splits. The subsets of 
Al.a sub-lineage and small southern outbreak experi- 
ments showed greater degrees of spatial heterogeneity 
than did the large southern outbreak experiment set. 
(see Additional File 1: Random Subsets for illustration, 
available as a PDF file, and Additional File 2: Accuracy 
metrics for random subsets, available as a PDF file). 

Each GARP model is composed of 50 if-then type rules 
(logic, range, negated range and atomic) which predict the 
presence or absence of the species for each pixel. Rule 
types for the ten best subset models from the outbreak- 



soil and Al.a sub-lineage experiments were extracted and 
are summarized in Table 3. Just over half of the outbreak- 
soil experiment rules were logit and no atomic rules were 
included, whereas range rules made up over 60% of the 
Al.a sub-lineage experiment rule types and this experi- 
ment included four atomic rules in the best subsets. 
Between 6 and 13 rules defined greater than 90% of areas 
predicted to be suitable for B. anthracis for each of the 
best subsets. Of the 95 rules which predicted the majority 
of the landscape in the outbreak-soil experiment, the 
majority (83%) were presence rules and of these 62% were 
range rules. The Al.a sub-lineage experiment had 99 total 
rules predict the majority of the landscape; all but one of 
these was a presence rule and 73% of the presence rules 
were range rules. The environmental tolerances described 
by the dominant rules suggest that mean NDVI, altitude, 
mean temperature, minimum soil calcium and minimum 
soil organic content are limiting variables for B. anthracis 
in Kazakhstan (Figure 5). Median minimum values of 
mean NDVI, NDVI amplitude, annual precipitation, dry 
month precipitation, wet month precipitation, mean tem- 
perature, altitude and soil organic content are significantly 
different between the Al.a sub-lineage and the outbreak- 
soil experiment using the Wilcox-Mann- Whitney test at a 
95% significance level. Median maximum values of NDVI 
amplitude, mean temperature, dry month precipitation, 
altitude, soil base saturation and soil organic content differ 
between the two experiments. 



Table 2 Sample sizes and accuracy metrics for GARP model building and evaluation Table 3 



Model 



Metric 


Outbreak-Soil 


Al.a sub-lineage 


Large Southern Outbreak 


Small Southern Outbreak 


N to build models* 


218 


26 


113 


26 


N to test models 


39 


13 


145 


147 


Total Omission 


2.6 


0.0 


0.0 


0.0 


Average Omission 


9.9 


13.1 


19.1 


15.5 


Total Commission 


32.7 


19.18 


12.74 


17 


Average Commission 


58.4 


66.11 


49.26 


56.35 


AUCt 


0.7188 


0.6964 


0.7401 


0.7386 


SE 


0.0466 


0.0817 


0.2410 


0.04 


Z 


90.94 


4.4449 


16.3284 


16.6241 



*N was divided into 50%training/50% testing for each model iteration 
tAUC = area under the curve 
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Outbreak Locations 

• Training n=21 8 
O Independent n=40 




Sample Locations N 

□ A3b ^ 

O A1 .a Independent n=1 3 
• A1 .a Training n=26 



Figure 1 Map of Kazakhstan with anthrax locality data. (A) Locations of genotyped isolates. Inset shows locations of outbreaks used for the 
full outbreak model. Green points indicate training data for input into GARP and yellow points indicate independent points used to evaluate 
model accuracy. Shaded area indicates southern polygon used for creating projected models. 



Values of the limiting variables were extracted from 
areas of ten best subset model agreement and plotted in 
two dimensional variable space. The Al.a sub-lineage 
experiment showed a broader ecological envelope than 
the outbreak-soil experiment based on areas of ten best 
subset model agreement, despite the smaller geographic 
area predicted by agreement of all ten models (Figure 
6). The two A4 locations, which are distant from each 
other geographically, are found within a narrow range of 
mean NDVI and mean temperature, but occupy nearly 
opposite ends of the range of precipitation values. 
Finally, the A3.b location was associated with ecological 
conditions towards the outer boundaries of the ecologi- 
cal envelope predicted by the outbreak-soil experiment. 

Discussion 

This study assesses the addition of soil variables to a 
previously developed ecological niche model for Bacillus 
anthracis and is the first known to model the ecological 
and geographic distribution of a single sub-lineage of B. 
anthracis. Inclusion of available soil variables into our 



anthrax outbreak model resulted in subtle changes in 
the likelihood of the pathogen in areas of northern 
Kazakhstan, but did not substantially change the extent 
of geographic predictions or results of rule set analyses 
[26]. The areas predicted as less suitable by the out- 
break-soil model correspond to regions of locally differ- 
ent values for all four soil variables (see Additional File 
3: Soil Variables, available as a PDF document). How- 
ever, it is not known whether these areas represent a 
unique ecological region or if measurement in these 
areas was affected by error or bias. The low minimum 
soil calcium association found in the rule set analyses 
contrasts with previous literature suggesting that B. 
anthracis spore persistence is associated with high soil 
calcium levels [3,30]. The results of our rule set analysis, 
however, are not directly comparable to previous work 
in that different units of measurement and sampling 
techniques were used. In addition, the soils data avail- 
able for this study had a relatively coarse resolution of 1 
km and were further aggregated to 8 km to match other 
climatic data for model development. As a result, fine 
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Figure 2 Predicted distribution of Bacillus anthracis in Kazakhstan. Predicted distribution of Bacillus onthracis in Kazakhstan based on 
outbreak data with and without soil variables. (A) Outbreak experiment (excluding soil variables) [26], (B) Outbreak-soil experiment (including soil 
variables), (C) Differences between distributions predicted by the two experiments. 



resolution relationships between soil and anthrax occur- 
rences would likely be missed in this experiment. 
Improved resolution of soil and outbreak data, such as 
exact carcass locations, are likely necessary to character- 
ize the role of soil parameters in promoting anthrax 



spore persistence [3,31] and for better understanding 
the spatio-temporal dynamics and ecology of local out- 
breaks [32]. 

The Al.a sub-lineage experiment predicted a more 
extensive geographic area of anthrax presence than did 
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Figure 3 Predicted geographic distribution of the Bacillus anthracis Al.a sub-lineage. Comparison of predicted geographic distributions of 
B. anthrocis. (A) distribution of B. anthrocis predicted by the sub-lineage experiment, (B) difference between predicted distributions of the sub- 
lineage and the outbreak-soil experiments. 

V J 




the outbreak experiment. This is most pronounced in 
the northern and central portions of the country. The 
median minimum values of most variables defined by 
the dominant rule sets were significantly different. 
When ecological values of limiting variables were 
extracted from geographic areas of best subset 



agreement and plotted in two dimensional variable- 
space, the Al.a sub-lineage was associated with a larger 
ecological envelope than the outbreak-soil data. This 
finding illustrates that analysis of dominant rule sets 
alone should be interpreted with some caution. The 
variable ranges derived from the dominant rule sets 
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Figure 4 Predicted geographic distribution of B. anthracis based on the large southern outbreak experiment and small southern 
outbreak experiment. Predicted geographic distribution of B. onthrocis based on (A) large southern outbreak experiment and (B) small 
southern outbreak experiment. (C) Difference between predicted geographic distributions. Green points indicate training data for input into 
GARP and yellow points indicate independent points used to evaluate model accuracy. 
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Table 3 Rules types from ten best models of the outbreak-soil and Al.a sub-lineage experiments. Values shown are 
number of rule types in the rule set (column %) 



Outbreak-Soil Rule Set 


Rule Type 


2 


6 


13 


21 


24 


25 


29 


39 


48 


51 


Total 


Log it 


17(34) 


32(64) 


34(68) 


25(50) 


22(44) 


26(52) 


36(72) 


35(70) 


25(50) 


27(54) 


279(55.8) 


Negated Range 


2(4) 


1(2) 


0(0) 


0(0) 


2(4) 


2(4) 


0(0) 


1(2) 


7(14) 


7(14) 


22(4.4) 


Range 


31(62) 


17(34) 


16(32) 


25(50) 


26(52) 


22(44) 


14(28) 


14(28) 


18(36) 


16(32) 


199(39.8) 


Al.a Sub-lineage Rule Set 


Rule Type 


1 


10 


21 


40 


49 


51 


54 


76 


91 


93 


Total 


Atomic 


1(2) 


2(4) 


1(2) 


0(0) 


0(0) 


0(0) 


0(0) 


0(0) 


0(0) 


0(0) 


4(0.8) 


Logit 


15(30) 


5(10) 


22(44) 


20(40) 


30(60) 


2(4) 


7(14) 


24(44) 


22(44) 


9(18) 


156(31.2) 


Negated Range 


10(20) 


0(0) 


4(8) 


2(4) 


0(0) 


0(0) 


0(0) 


7(14) 


0(0) 


0(0) 


23(5.6) 


Range 


24(48) 


43(86) 


23(46) 


28(56) 


20(40) 


48(96) 


43(86) 


19(38) 


28(56) 


41(82) 


317(63.4) 



ph(0 

ph (A 
CalciumfO 
Calcium(A 
Organic Content (O 
Organic Content (A 
base 5aturation(0 
base saturation (A 
Altitude (O 
Altitude (A 
Temperature Range (O 
Temperature Range (A 
Mean Temperature (O 
Mean Temperature [A 
DriestMonth (O 
DriestMonth (A 
Wettest Month (O 
Wettest Month (A 
Precipitation (O 
Precipitation (A 
NDVI Amplitude (O 
NDVI Amplitude (A 
Mean NDVI (O 
Mean NDVI (A 



0.0 



0.2 



0.4 



0.6 



0.S 



1.0 



Figure 5 Median ranges of environmental variables predicting B. anthracis presence by the outbreak-soil experiment. 0 = Outbreak-Soil 
Experiment; A = A1.a sub-lineage experiment. 
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0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 




□ Outbreak-Soil 

□ Large S. Outbreak 

□ Small S. Outbreak 




□ Outbreak-Soil 

□ Large S. Outbreak 

□ Small S. Outbreak 



Figure 6 Distribution of B. anthracis in ecological space. Predicted distribution of B. anthracis in ecological space based on areas of ten best 
subset model agreement. Red points = outbreak locations, blue points = A1.a isolate locations, black triangles = A4 locations, black circle = A3.b 
location. 



summarize approximately only one-fifth (or fewer) of 
the total number of rules generated by GARP in the ten 
best subsets. The values extracted from predicted areas 
on the landscape are derived from all 500 rules con- 
tained in the 10 best subsests and represent the spec- 
trum of complex interactions between variables and the 
landscape. Because geographic areas of model agreement 
can be thought of as representative of all sub-sampled 
regions or populations [33], the finding of a larger eco- 
logical envelope for the Al.a sub-lineage experiment 
lends support to the hypothesis that the Al.a sub- 



lineage of B. anthracis may have broad environmental 
tolerances that influence its broad geographic distribu- 
tion [3,24]. 

The A lineage is more widely distributed globally than 
other subtypes, perhaps reflecting a greater level of fit- 
ness as compared to other lineages [24]. This finding 
has been shown on a local scale as well. Isolates of the 
A lineage in Kruger National Park, South Africa, as 
defined by MLVA-8 typing, were more diffusely distrib- 
uted and showed a distinctly different spatial cluster pat- 
tern than those of the B lineage. Furthermore, the B 
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lineage isolates occupied a narrow range of available 
ecological conditions within those occupied by the A 
lineage isolates [3]. In the experiments reported here, 
the Al.a sub-lineage locations were associated with 
lower pH values than the outbreak locations and this 
could provisionally support the findings from Kruger 
National Park [3] . It follows that that a B. anthracis line- 
age and/or sub-lineage other than Al.a may predomi- 
nate in the northern regions of Kazakhstan and is 
driving the narrower geographic and ecological predic- 
tion of the outbreak-soil experiment. Genotyping of 
additional isolates from northern Kazakhstan is neces- 
sary to evaluate this hypothesis. Locations of outbreaks 
of the A4 lineage in geographical and ecological space 
suggests that this genotype may also have a relatively 
broad distribution, which is consistent with the A4 sub- 
lineage being found across the Middle East and China 
[29,34]. Although we cannot make inferences regarding 
the environmental affinities of the single A3.b isolate, 
we note it is located on the far eastern border of 
Kazakhstan in an area predicted to be unsuitable for 
anthrax by the outbreak-soil experiment. The A3.b sub- 
lineage has been isolated from geographically limited 
areas, most notably northern China and Texas, and the 
presence of this genotype is likely a result of historical 
trade routes [24,34,35]. 

The genetic diversity of B. anthracis isolates in south- 
ern Kazakhstan is not surprising given the location of 
this area along the historic Silk Road [29], but this 
diversity also implies that this region is supportive of 
spore persistence. Associations between genotypes of B. 
anthracis, environment, virulence or host species have 
not yet been fully explored and it is unknown whether 
genotype influences epidemiological characteristics of 
outbreaks. Understanding these relationships will 
improve our understanding of anthrax disease ecology, 
help focus surveillance and efficiently direct proactive 
vaccination. Furthermore, this knowledge can help dis- 
tinguish between naturally occurring outbreaks, contam- 
ination and potential bioterrorism, and greatly enhance 
epidemiological trace back (tracing outbreaks to source) 
efforts during outbreaks. 

Few studies using GARP have quantified and exam- 
ined rule types. GARP begins creating rule sets by 
choosing the first rule type at random and successful 
rules are carried forward into subsequent rule sets. 
Thus, the first rule type chosen at random will often 
predominate in the final rule sets. Blackburn and Joyner 
et. al. both presented summaries and distribution of the 
dominant rule types predicting B. anthracis in the conti- 
nental United States and Kazakhstan, respectively 
[26,32]. Blackburn noted a predominance of range rules 
among a total of 63 rules predicting greater than 90% of 
the landscape [32]. Joyner et al divided Kazakhstan into 



northern and southern halves and modelled the two sec- 
tions separately [11]. A greater percentage of range rules 
described the northern half of the country, whereas logit 
rules predominantly described the southern half of the 
country. Here, logit rules dominated in the out-break 
soil experiment and range rules in the Al.a sub-lineage 
experiment. Further work is required to tease out 
whether dominant rule types result from the stochastic 
nature of GARP or are related to complex interactions 
between the organism and environmental variables. 
Additional future work should also explore how rule 
sets and ecological values found over predicted areas of 
the landscape can be used to enhance our understand- 
ing of the ecology of an organism. 

Several authors have noted that ENM-based predic- 
tions of species with widespread distributions show 
reduced model accuracy which can be improved by 
dividing species or the range into sub-units [33,36-39]. 
One explanation for the apparent poor accuracy of 
models of widespread species is the use of AUC. The 
AUC is sensitive to the area predicted to be suitable for 
a species relative to the total land area analyzed [33,40]. 
Other considerations include non-uniformity of pre- 
sence locations (geographical bias), and biological fac- 
tors such as local ecological adaptations and genetic 
diversity [33,37-39]. For example, modelling of Franci- 
sella tularensis genotypes in the US yielded overlapping, 
yet different, geographic predictions and ecological asso- 
ciations [12]. Interestingly, this difference was apparent 
at intermediate, as opposed to coarse, phylogenetic 
levels. Similarly, Fisher et. al. showed that three geno- 
type categories of the broadly distributed pathogenic 
fungus Penicillium marneffei correlated with environ- 
mental heterogeneity across Vietnam [21]. Using GARP, 
the genotypes classes were predicted to occupy three 
non-overlapping geographic areas. As a consequence of 
B. anthracis being a widely distributed species, models 
of anthrax outbreaks are subject to similar limitations in 
accuracy. Genotype specific models may therefore have 
improved accuracy and predictive power, and should be 
explored at multiple phylogenetic levels. 

It is worthwhile to evaluate modelling limitations 
which could potentially explain differences between 
experiments. Despite studies showing GARP to be 
robust when predicting new landscapes [14,25,41], our 
use of a projected modelling strategy may wrongly pre- 
dict geographic and ecological distribution given the 
large geographic area in question. However, projected 
models created using southern outbreak points show 
similar geographic predictions as the outbreak-soil 
experiment, supporting that the broader geographic Al. 
a sub-lineage prediction is not simply an artefact of the 
modelling technique. That the large and small southern 
outbreak experiments showed a lesser degree of model 
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agreement than the outbreak-soil experiment and that 
the large outbreak subsetting procedure had improved 
spatial homogeneity over the small southern outbreaks 
subsetting indicates that projected models may be sensi- 
tive to issues of sample size and clustering. Issues relat- 
ing to sample size appear to be important in our study. 
The predicted geographic distribution of B. anthracis 
varied among the 10 random data splits for both the Al. 
a sub-lineage and small southern outbreak experiments. 
In contrast, the same random data splitting procedure 
performed with the large southern outbreak and the 
outbreak-soil experiments showed spatial homogeneity 
among models [26]. Although Stockwell and Peterson 
and Hernandez found that as few as 10 presence points 
are adequate for accurate GARP models [36,39], addi- 
tional work has shown that certain species or geographic 
scenarios are more sensitive to sample size than others 
and that model accuracy can be sensitive to both sample 
size and extent of the species' range [42]. Here, any 
effect of sample size is likely exaggerated by projection 
onto a large geographic area and the relatively limited 
resolution the data [36,42]. 

Some additional limitations apply to our findings. The 
Al.a sub-lineage isolate collection was derived from mul- 
tiple species, including livestock and humans, and from 
soil samples. The outbreak data, however, were derived 
from livestock only. The impact of this on model com- 
parisons is unknown since associations between host, 
genotype and environment are as yet unexplored. We are 
limiting our modelled area by political boundaries as 
opposed to biogeographic limits. Finally, the genotyped 
isolate collection is geographically biased towards the 
southeast portion of Kazakhstan and spans a relatively 
long period of time [29]. Sampling bias is common in 
niche models using historical collections and can create 
artificial patterns in the data, although GARP is arguably 
less sensitive than other modelling algorithms to spatial 
bias [10,37,43,44]. Future genotyping of additional iso- 
lates from under-sampled portions of the country, parti- 
cularly the northern oblasts, will be essential to better 
characterize the genetic diversity and ecology of anthrax 
in Kazakhstan, allowing the construction of more refined 
predictive models. 

Conclusions 

The inclusion of available soil variables resulted in 
subtle changes in the predicted geographic distribution 
of anthrax in Kazakhstan, but the experiment is limited 
by the nature of available soil variables. Standardized 
soil variables and finer resolution data will be essential 
to characterizing the importance of soil parameters in B. 
anthracis persistence. The Al.a sub-lineage experiment 
showed a larger geographic and ecological distribution 
than the outbreak based experiment. Understanding 



genetic-environmental associations will be essential to 
accurate modelling of anthrax for use in disease preven- 
tion and control in Kazakhstan 

Methods 

Anthrax occurrence data 

Presence points for outbreak-based models were taken 
from the database created by Joyner et.al. [26]. Briefly, 
historical records were used to construct a database of 
3,947 anthrax outbreaks reported in Kazakhstan between 
1937 and 2006. The data were sequentially filtered to 
create a dataset containing the latitude and longitude of 
outbreaks in cattle, sheep and goats which occurred 
between 1960 and 2000. This time period reflects the 
implementation of mass vaccination and corresponds to 
the averaged data from both the WorldClim and soils 
data sets [45,46]. The final dataset contained 258 spa- 
tially unique points, meaning that only one outbreak 
point occurred in each 8 km 2 pixel. An 8 km 2 resolution 
was chosen because outbreaks were mapped to the near- 
est village and some outbreaks occurred greater than 1 
km from the village coordinates. This data set is here- 
after referred to as the full outbreak dataset. 

A second dataset was constructed using outbreak iso- 
lates genotyped by Aikimbayev et.al. [29]. Isolates were 
grouped into the Al.a (n = 78), A3.b (n = 6) and A4 (n 
= 4) sub-lineages using unweighted pair group method 
with arithmetic mean (UPGMA) cluster analysis and the 
89 B. anthracis genotypes identified by Keim et.al. [22]. 
This was filtered to contain only spatially unique points 
at a resolution of 8 km 2 resulting in 42 spatially unique 
points, of which 39 were Al.a, two were A4 and 1 was 
A3 b. Locality data are mapped in Figure 1. Only the 
Al.a sub-lineage had an adequate number of spatially 
unique points for modelling. The Al.a locations were 
geographically biased towards the south-eastern portion 
of the country. To reduce the problem of largely 
unsampled areas being considered as absence points, we 
created a polygon encompassing southeast Kazakhstan 
using latitude 48 N and 60 E as the northern and wes- 
tern boundaries, respectively, and the country's political 
boundaries in the south and east. The boundaries of this 
southern polygon were derived from examination of the 
locations of the Al.a isolates and from the different 
northern and southern ecological associations noted by 
Joyner [26]. The southern polygon was used to clip the 
Al.a locality points in ArcMap and this set of southern 
Al.a locations was used for the sub-lineage experiment. 
The same procedure was used to create a southern out- 
break dataset. 

Environmental Data 

We used six environmental coverages downloaded from 
the WorldClim website http://www.worldclim.org [46]. 
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The WorldClim variables are calculated from interpola- 
tion of monthly temperature and precipitation measure- 
ments recorded at stations located worldwide between 
1961 and 2000. Monthly values are transformed by 
WorldClim into 19 bioclimatic variable grids that 
describe annual trends, seasonality and potentially limit- 
ing ecological parameters such as temperature of the 
coldest and warmest months. Two satellite-derived 
environmental variables describing temperature and 
vegetation measures were obtained from the Trypanoso- 
miasis and Land Use in Africa (TALA) research group 
(Oxford, United Kingdom) [47]. 

We added four soil variables to the set of eight envir- 
onmental layers used in the previous study to test the 
influence of soil parameters on the outbreak model. Soil 
variables were derived from the Harmonized World Soil 
Database and were available at 1 km 2 resolution [45]. 
All coverages were re-sampled to 8 km 2 and clipped to 
the boundaries of Kazakhstan in Arc View 3.3. (Environ- 
mental Systems Research institute, Redlands, CA). An 
identical set of coverages was clipped to the southern 
polygon. The final set of coverages is given in Table 4. 

Ecological Niche Modelling 

This study used the Genetic Algorithm for Rule-Set Pre- 
diction (GARP) to perform the ecological niche model- 
ling [48]. Models were developed in Desktop GARP 
v. 1.1.3, which gives the user the option to write out the 
rule sets for each model. Briefly, GARP is a presence 
only modelling technique that detects non-random asso- 
ciations between species localities and specific environ- 
mental variables. Through an iterative process, 
relationships are expressed as a series of logic state- 
ments, or rules, of which there are four types: (1) logit - 
based on logistic regression; (2) atomic - single value for 



Table 4 Environmental coverages used for GARP models 



Environmental Variable (unit) 


Name 


Source 


Elevation (m) 


Altitude 


WorldClim* 


Annual Temperature Range (°C) 


BI07 


WorldClim 


Annual Mean Temperature (°C) 


BI01 


WorldClim 


Precipitation of Driest Month (mm) 


BI014 


WorldClim 


Precipitation of Wettest Month (mm) 


BI013 


WorldClim 


Annual Precipitation (mm) 


BI012 


WorldClim 


NDVI Amplitude (no units) 


wd1014 a1 


TALAt 


Mean NDVI (no units) 


wd1014 aO 


TALA 


Soil pH (-log(H + )) 




HWSD+ 


Topsoil Calcium (% weight) 




HWSD 


Topsoil Organic Content (% weight) 




HWSD 


Subsoil Base Saturation (%) 




HWSD 



*http://www.worldclim.org[46] 

t Trypanosomiasis and Land Use in Africa (TALA) research group [47] 
tHarmonized World Soil Database [45] 



a given variable that predicts presence; (3) range - a 
range of values of a given variable that predicts pre- 
sence; and (4) negated range - a range of values outside 
of which presence is predicted. Each individual GARP 
model is a set of 50 rules that are randomly generated, 
tested and modified. The user sets a maximum number 
of models to be created in a single experiment. A best 
subsets procedure within GARP then selects a set of 
optimal models based on user defined omission and 
commission criteria [48]. The algorithm is a two-step 
process, where first relationships are defined in variable 
space through a random walk and then applied to the 
geographic landscape where those conditions are met 
[25]. GARP therefore has the benefit of being able to 
project rule sets onto the environmental layers of a dif- 
ferent landscape and has been shown to be robust in 
this application [25,41]. 

Model building and evaluation 

To test the effect of soils on the outbreak experiment 
we used the full outbreak dataset as locality points and 
the twelve environmental variables described in Table 3. 
The 258 spatially unique points were randomly divided 
into an 85% (n = 218) training set used for model build- 
ing and a 15% (n = 39) testing set for model evaluation. 
The 32 southern Al.a points were divided into an 80% 
(n = 26) training set and a 20% (n = 6) testing set in 
order to maximize points available for testing [33,39,42]. 
The Al.a training set was input into GARP with the set 
of environmental coverages clipped to the southern 
polygon for model development. Rules from this south- 
ern Al.a experiment were projected onto the entire 
landscape of Kazakhstan. In order to test the robustness 
of the Al.a model projections given the relatively small 
sample size and issues of transferability, two experi- 
ments using the southern outbreak data were per- 
formed. The first utilized all 142 southern outbreak 
points and is referred to as the large southern outbreak 
experiment. For the second, 32 points were randomly 
selected from the southern outbreak dataset (small 
southern outbreak experiment). For these experiments 
an 85%/15% and 80%/20%, respectively, external data 
split was performed and the experiments conducted as 
for the sub-lineage. The four experiments are summar- 
ized in Table 4. 

For all niche modelling experiments, we specified 200 
models with a maximum of 1,000 iterations and a con- 
vergence limit of 0.01. The training data were input into 
GARP with a 50% training/ 50% testing internal data par- 
tition. The best subset procedure selected the best 20 
models under a 10% hard omission threshold and a 50% 
commission threshold. The resulting ten best subset 
models were imported in ArcGIS and summated using 
the raster calculator function of the Spatial Analyst 



Mullins et al. BMC Ecology 201 1, 11:32 
http://www.biomedcentral.eom/1 472-6785/1 1 /32 



Page 13 of 14 



extension. This created a single cumulative raster file of 
model agreement for B. anthracis presence ranging 
from 0 (all models predict absence) to 10 (all models 
predict presence). The more models that predict pre- 
sence for a given pixel, the higher the likelihood that 
the pixel can support B. anthracis. 

Rule types from the ten best models of the outbreak- 
soil experiment and the Al.a sub-lineage experiment 
were extracted with a python script (K.M. McNyset, US 
NOAA) and summarized to illustrate the relative num- 
bers of each rule type. Dominant rules, or the subset of 
rules that together predict over 90% of the landscape, 
for each model were identified. We extracted the mini- 
mum and maximum values of range rules using the 
python script. When logit rules were identified, we 
extracted the range of values across the pixels predicted 
by that rule using the "Extract Values to Points" routine 
of the Spatial Analyst extension in ArcMap. Median 
minimum and maximum values for each variable were 
calculated in SAS (SAS 9.2, Cary, N.C.) and plotted as a 
bar graph. Differences in median and maximum values 
between experiments were assessed using Wilcoxon- 
Mann_Whitney test in SAS. 

Predictive performance of the best subset models was 
evaluated with an area under the curve (AUC) in a 
receiver operating characteristic (ROC) analysis using 
the independent test data withheld from the original 
datasets [40]. For the projected models, testing points 
included presence points outside the southern polygon 
in addition to points withheld from within the southern 
polygon. Values of AUC, which range from 0.5 (no dif- 
ferent from random) to 1 (a perfect model), are derived 
from measures of sensitivity (absence of omission error) 
and specificity (absence of commission error). The cal- 
culated value is compared to that of a random model 
using a z-test. In addition, measures of omission and 
commission were calculated using the summed ten best 
subset models. Total and average omission values evalu- 
ate how well GARP predicts the presence of known 
locality points not included in the model building data. 
Total and average commission is the percent of pixels 
predicted as presence by the summated model and the 
average of this value for all ten best subset models, 
respectively. Large variation between the two measures 
of commission suggests substantial variation between 
the proportions of the landscape predicted present by 
each of the ten best subset models [33]. 

Environmental values of 5,000 randomly chosen points 
from areas predicted by all 10 of the best subset models 
were extracted using the "Extract Values to Points" rou- 
tine of the Spatial Analyst extension in ArcMap. Values 
of each environmental variable at each presence point 
and at 5,000 random points representing the total avail- 
able environmental space (background) were similarly 



extracted. Specific environmental values appearing to be 
limiting factors for prediction of B. anthracis were cho- 
sen based on the rule set evaluation (Figure 5) and 
visualized in 2-dimensional ecological space against the 
background of available environmental conditions using 
R 2.1.1 http://www.R-project.org. 

Additional material 



Additional file 1: Random Subsets. Predicted geographic distribution 
of B. anthracis based on 10 random subsets of input locality points for 
the Aa.1 sub-lineage, large southern outbreak and small southern 
outbreak experiments. 

Additional file 2: Accuracy metrics for random subsets. Accuracy 
metrics of 10 random subsets of input locality points for the Aa.1 sub- 
lineage, large southern outbreak and small southern outbreak 
experiments 

Additional file 3: Soil Variables. Mapped values of the four soil 
variables (minimum soil pH, minimum soil organic content, minimum soil 
calcium and minimum soil base saturation). 
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